In this report I will look to highlight the firms that should be focused on by our supervisors using key metrics that I will explain throughout the report. I will keep all my code in the report for audit and reusability purposes as conclusions may change as new data arrives. I will focus this piece on the three main characteristics stated: Firm Size, Changing Business Profile and Outliers. After this I will be performing some machine-learning application on the data using principal component analysis and k-means clustering. I will conclude by writing on the benefits this report would receive from being hosted on a cloud server.
I have used a combination of openxlsx and tidyverse packages to render the xlsx into a R-readable output. Below I have produced a summary of all the variables.
| firm | year | value_type | eof_for_scr_m | equity_m | gwp_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | nwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | pure_gross_claims_ratio | pure_net_claims_ratio | scr_m | scr_coverage_ratio | total_assets_m | total_liabilities_m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:9120 | Length:9120 | Length:9120 | Min. : -162.25 | Min. : -123.610 | Min. : -19.78 | Min. : -167.82 | Min. :-189.936 | Min. : -14064 | Min. :-16112.4 | Min. :-17754.10 | Min. : -182.382 | Min. :-5468706 | Min. :-4495420 | Min. :-24817.8 | Min. :-33721.0 | Min. : 0.000 | Min. : -3 | Min. : -207.4 | Min. : -1487.1 | |
| Class :character | Class :character | Class :character | 1st Qu.: 5.98 | 1st Qu.: 5.798 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 1.794 | 1st Qu.: 1 | 1st Qu.: 14.4 | 1st Qu.: 3.8 | |
| Mode :character | Mode :character | Mode :character | Median : 32.09 | Median : 34.276 | Median : 15.47 | Median : 10.48 | Median : 3.392 | Median : 0 | Median : 0.1 | Median : 7.95 | Median : 4.502 | Median : 0 | Median : 0 | Median : 0.2 | Median : 0.1 | Median : 14.533 | Median : 2 | Median : 113.0 | Median : 63.0 | |
| NA | NA | NA | Mean : 455.68 | Mean : 480.865 | Mean : 853.94 | Mean : 259.07 | Mean : 116.696 | Mean : 1859 | Mean : 273.7 | Mean : 689.12 | Mean : 164.284 | Mean : 172 | Mean : 1370 | Mean : 463.0 | Mean : 388.1 | Mean : 308.437 | Mean : 1301158 | Mean : 6333.0 | Mean : 6363.4 | |
| NA | NA | NA | 3rd Qu.: 165.65 | 3rd Qu.: 176.937 | 3rd Qu.: 198.96 | 3rd Qu.: 159.34 | 3rd Qu.: 79.463 | 3rd Qu.: 1 | 3rd Qu.: 0.3 | 3rd Qu.: 125.95 | 3rd Qu.: 93.771 | 3rd Qu.: 1 | 3rd Qu.: 0 | 3rd Qu.: 0.5 | 3rd Qu.: 0.5 | 3rd Qu.: 95.191 | 3rd Qu.: 3 | 3rd Qu.: 955.4 | 3rd Qu.: 655.7 | |
| NA | NA | NA | Max. :41636.30 | Max. :26705.042 | Max. :74078.64 | Max. :19292.07 | Max. :6844.014 | Max. :3978266 | Max. :575064.8 | Max. : 75526.67 | Max. :11351.609 | Max. : 2445529 | Max. : 2691212 | Max. :985011.3 | Max. :828809.4 | Max. :22788.359 | Max. :999302877 | Max. :553549.9 | Max. :494498.5 | |
| NA | NA | NA | NA’s :2620 | NA’s :2620 | NA’s :2620 | NA | NA | NA | NA | NA’s :2620 | NA | NA | NA | NA | NA | NA’s :2620 | NA’s :2620 | NA’s :2620 | NA’s :2620 |
From this summary we can see each of the columns are in the correct data type and there is significant variance across the numeric variables. Due to the extreme outliers this will influence the statistical metrics I used for my analysis as I will require more variance-robust metrics.The dataset looks like this now:
| firm | year | value_type | eof_for_scr_m | equity_m | gwp_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | nwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | pure_gross_claims_ratio | pure_net_claims_ratio | scr_m | scr_coverage_ratio | total_assets_m | total_liabilities_m |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Firm 1 | 2016 | mean_value | 484.0712 | 249.6647 | 9.353606 | 0.1177988 | 0.0011030 | 0.1350101 | 15.45025 | -3550.820 | 1.978919 | 14.25977 | 8.202612 | 3.506938 | 3.436862 | 404.2644 | 11177400.6440186 | 12737.29 | 8034.337 |
| Firm 1 | 2016 | median_value | 0.0000 | 0.0000 | 0.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.0000000 | 0.00 | 0.000 |
| Firm 1 | 2016 | sd_value | 1080.2952 | 555.8116 | 20.915299 | 0.2634062 | 0.0024665 | 0.3018918 | 34.54781 | 7939.875 | 4.424999 | 31.88581 | 18.341598 | 7.841752 | 7.685056 | 903.9626 | 24993427.5609242 | 28476.09 | 17965.323 |
| Firm 1 | 2016 | time_series_value | 2416.5598 | 1243.9250 | 46.768030 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | -17754.100 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 2021.3219 | 0.1625197 | 63676.87 | 40171.683 |
| Firm 1 | 2017 | mean_value | 484.0712 | 249.6647 | 9.353606 | 0.1177988 | 0.0011030 | 0.1350101 | 15.45025 | -3550.820 | 1.978919 | 14.25977 | 8.202612 | 3.506938 | 3.436862 | 404.2644 | 11177400.6440186 | 12737.29 | 8034.337 |
I have set up the datatable so that only value_type time_series_value varies over time, all the other variables in this column stay consistent. This makes it straightforward to wrangle to find the top/bottom n firms or most/least varying firms by variable as you will see in my visualization code.
There are some caveats which will be worth considering with my data-centric approach to resource allocation, these are:
To consider firm size I will look at:
As we want to consider both magnitude and consistency of the variables, I will use the median score to select the highest rated firms and will plot them against the rest.
From this graph we can see there are five firms that are distinctly larger than the rest and I have only considered the top 20% of firms by median value. It is worth highlighting, and I will explore later, the volatility of some of the firms accounts as we can see firm 311 and 210 have large drops in 2020 (YE).
Only Firms 210 and 4 are considerably larger than the rest of the top 20% of firms (for this variable).
Firm 4 has appears twice now, firstly for having a significantly larger Net Written Premium and now for Equity. This suggests to us that it is a largely profitable insurance firm with a large equity holding meaning it is in a very healthy financial position. I would suggest this firm is definitely an important player in the market and one to monitor.
| firm | size_score | nwp_m | net_bel_m | pure_gross_claims_ratio |
|---|---|---|---|---|
| Firm 188 | 582412.8404 | -127.462353 | 12.765106 | 1747353.2183443 |
| Firm 210 | 3024.4069 | 8971.023310 | 94.466754 | 7.7306263 |
| Firm 4 | 2342.8611 | 7052.279994 | 13.346004 | -37.0426236 |
| Firm 26 | 1880.1809 | 5495.078203 | 141.915499 | 3.5489176 |
| Firm 72 | 1476.9343 | 74.731447 | 1.834567 | 4354.2369564 |
| Firm 105 | 1308.1439 | 2920.323677 | 996.200227 | 7.9077405 |
| Firm 28 | 1258.6591 | 7.048297 | 5.674840 | 3763.2540889 |
| Firm 284 | 969.9516 | 12.125277 | 11.191485 | 2886.5379034 |
| Firm 25 | 639.4467 | 1652.469218 | 265.069667 | 0.8012794 |
| Firm 311 | 617.2461 | 1846.587772 | 4.626903 | 0.5234772 |
I have created a table here that provides a potential list of the top 10 firms. I have calculated this by taking the mean modified z score for Net Weighted Premium, Net BEL (inc. TPs as a whole, pre-TMTP) and Pure Gross Claims Ratio. Firstly, I chose to use the modified z-score (you can find more here) as it is more robust to extreme outliers than the standard z score and it gives us a good idea on the position of the metrics for each firm when compared to the population. I chose those three variables because collectively these are the three variables that are highly positively correlated with every variable in the dataset and so roughly they will contain the information of the dataset. As you can see this needs to be read with some scepticism still as there are major outliers that disrupt the results, for example Firm 188’s Pure Gross Claims Ratio value.
Volatility of a firm’s accounts are of considerable concern to Supervisors and so it’s important to investigate.
I will start with considering variation the metric level. I am going to use relative standard deviation as this will allow me to compare standard deviations across each of the variables effectively.
This chart shows us that Relative Absolute Standard Deviation (RASD) varies across metrics over time and that there are multiple metrics with high variance however there are six main metrics that have a high relative standard deviation all of which are ratios these are: gross_expense_ratio, net_expense_ratio, gross_combined_ratio, net_combined_ratio, pure_gross_claims_ratio, and pure_net_claims_ratio. Although there are 6 they are 3 highly correlated net and gross pairs. There are also strong relationships by definition. net_combined_ratio is \(\frac{Sum of net claims and expenses incurred}{Net Earned Premium}\) while net_expense_ratio is just the expenses part of the equation suggesting this could be an expenses variance. We also see a massive spike in 2016 for RSD in gross_expense_ratio which I would suspect is misreporting especially given no spike in net_expense_ratio and I will explore later. I will look into the firms that have produced these high variations in the net variables as these will likely explain the variations in the gross variables also.
We can see from this graph that net_expense_ratio spikes has potentially come from these two enormous values that have been reported. I will add a filter to this chart and the other ratio charts to have median values between 0 and 0.87 (this is the range set by 3Q + 1.5 * IQR).
After filtering out misreporting firms, we can see the five firms whose data has changed most substantially over the period 2016-20. This is important for supervisors to be aware of because its ability to damage investors confidence in the firm as well as conduct any reasonable long-term financial planning.
As Net Combined Ratio indicates profitability of a firm this is a very important metric and we can see that multiple firms with volatile ratios. Firms 287, 291 and 417 being in the top 5 most volatile (highest standard deviations) and high net combined ratio (above 1.5) would be a large cause for concern for supervisors as this suggests they are very unprofitable and it is very difficult to forecast their future profitability.
This is calculated as \(\frac{Net Claims}{Net Earned Premium}\) and so a firm like 38 and 253 who has managed to reduce this from a poor position is a positive and something that would stop concern from supervisors. 60, 142, and 306 are concerning for the same reasons as previously mentioned, high variance and currently very high proportionate costs.
I’ve shown a couple of techniques I have used to filter out outliers and highlight elements in the time series. I have chosen to use the interquartile range method due to the presence of extreme outliers in the data. Standard Deviation is influenced much more strongly by extreme outliers. I have created a flagging system, if the data point is outside of \((Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR,)\) for the entire metric population then it is a ‘IQR Outlier’, if it is outside of the 2nd or 8th quantile then it is labelled a ‘Decile Outlier’ else it has ‘No Flag’. I will provide an example below for firm 12.
We can see there is fair homogeneity in submissions and there is no systemic reporting issues towards any particular metric.
This approach will allow users to do is to flagged submissions quickly, if we take Firm 139’s scr_m and Firm 116’s gross_bel_m submissions we can see that they have a range of flags throughout the series.
| firm | year | scr_m.x | nwp_m.x | scr_m.y | nwp_m.y |
|---|---|---|---|---|---|
| Firm 139 | 2016 | No Flag | IQR Outlier | 148.44436 | 700.60412 |
| Firm 139 | 2017 | No Flag | IQR Outlier | 102.00252 | 678.88901 |
| Firm 139 | 2018 | IQR Outlier | IQR Outlier | 384.08633 | 1013.57043 |
| Firm 139 | 2019 | No Flag | Decile Outlier | 102.85466 | 208.04965 |
| Firm 139 | 2020 | Decile Outlier | IQR Outlier | 137.17073 | 468.09469 |
| Firm 222 | 2016 | No Flag | IQR Outlier | 157.07521 | 223.59939 |
| Firm 222 | 2017 | No Flag | IQR Outlier | 92.17920 | 423.15529 |
| Firm 222 | 2018 | IQR Outlier | Decile Outlier | 286.40581 | 291.80749 |
| Firm 222 | 2019 | No Flag | No Flag | 66.80772 | 153.44594 |
| Firm 222 | 2020 | No Flag | No Flag | 78.87480 | 79.30072 |
If we graph that, we can see how this flagging approach works in practice:
This method allows us to look at each firm’s submission against the rest of the population and detect whether we consider it an outlier or not. In this case we can safely assume the IQR Outlier will require a resubmission. Utilising IQR outlier markers shows a clear picture on where an individual’s firm submissions deviate significantly from the population.
For the machine learning calculations, I am going to remove all IQR outliers to stop poor data quality impacting the outputs.
filtered_df = dplyr::anti_join(
x = df %>%
filter(value_type == 'time_series_value') %>%
pivot_longer(cols = -c('firm', 'year', 'value_type')),
y = outliers_df %>%
pivot_longer(cols = -c('firm', 'year')) %>%
filter(value == 'IQR Outlier'),
# remove any that match firm , year and name
by = c('firm', 'year', 'name')
) %>%
# we are going to describe the whole time series so just need firm and metric
dplyr::group_by(firm, name) %>%
dplyr::summarise(
# this are the variables I will use (for now)
median_value = median(value, na.rm = T),
# mad_value = mad(value, na.rm = T)
) %>%
dplyr::filter(
!is.na(median_value)
) %>%
tidyr::pivot_wider(
names_from = name,
values_from = median_value) %>%
# remove any rows with NAs - this will reduce the dataset but will lead to cleaner results
stats::na.omit()
kable(head(filtered_df)) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%")
| firm | eof_for_scr_m | equity_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | gwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | nwp_m | pure_gross_claims_ratio | pure_net_claims_ratio | scr_coverage_ratio | scr_m | total_assets_m | total_liabilities_m |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Firm 1 | 0.00000 | 0.00000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.00000 | 0.000000 |
| Firm 100 | 24.31249 | 132.55980 | 45.90523 | 66.2708894 | 0.8583713 | 0.2994157 | 29.95186 | 102.0876430 | 0.2561299 | 0.3134123 | 22.37252 | 0.6318334 | 0.1521047 | 0.9063138 | 34.142618 | 406.01092 | 40.392517 |
| Firm 101 | 0.00000 | 0.00000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 1.4076342 | 0.000000 | 0.00000 | 0.000000 |
| Firm 102 | 0.00000 | 0.00000 | -16.43230 | 137.9733993 | 1.1148258 | 0.3940367 | 0.00000 | -8.0739364 | 1.2914292 | 0.6520511 | 0.00000 | 0.4760284 | 0.3427111 | 0.0000000 | 0.000000 | 0.00000 | 0.000000 |
| Firm 103 | 29.04979 | 22.21371 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.0000000 | 0.0000000 | 2.5955507 | 1.005624 | 64.46984 | 2.291058 |
| Firm 106 | 76.30481 | 182.72545 | 12.16381 | 0.0443066 | 0.0000000 | 0.0000000 | 117.99910 | 0.8681491 | 0.0000000 | 0.0000000 | 127.69550 | 0.0000000 | 0.0000000 | 0.9292053 | 55.887545 | 116.58011 | 0.000000 |
One of the strengths of the PRA’s supervisory infrastructure is that we are able to categorise firms based on their size, business model, area of operation etc. Each of these characteristics require their own supervisory approaches which the PRA can adapt to and enables more effective regulation. For my machine learning application I will look to use K-means clustering to put the banks into various distinct categories. Firstly I will extract the important data from the features using principal component analysis (PCA). PCA can identify the underlying structure of the data, which can make it easier for the clustering algorithm to find the true clusters. I have chosen this cleaning approach before clustering because as a dimension reduction process it can help to extract the most important elements of each of the features and reduce the dataset into a more comprehensible size. With the multicolinearity problems that are shown at the start, this approach will mitigate this issue and improve cluster quality. With tidymodels we can easily create a pipeline from input to PCA output
pca_recipe <- recipes::recipe(firm ~ ., data = filtered_df) %>%
# step_naomit(all_numeric()) %>%
# we only want to sue complete columns
recipes::step_filter_missing(
all_numeric(),
threshold = 0) %>%
# normalize all columns
recipes::step_normalize(all_numeric()) %>%
# compute principal components
recipes::step_pca(
all_numeric(),
threshold = .95
)
pca_prep = recipes::prep(x = pca_recipe, training = filtered_df)
# plot this and cluster
pca_output_df = recipes::bake(object = pca_prep, filtered_df)
kable(head(pca_output_df)) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%")
| firm | PC01 | PC02 | PC03 | PC04 | PC05 | PC06 | PC07 | PC08 | PC09 | PC10 | PC11 | PC12 | PC13 | PC14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Firm 1 | 1.866716 | 1.9426909 | -1.0615485 | 0.3023471 | -0.0124155 | 0.0329275 | 0.0477477 | -0.3485232 | 0.0700634 | -0.1339448 | 0.1071496 | 0.0541784 | -0.0788456 | 0.0135459 |
| Firm 100 | -2.060271 | -0.5487977 | -0.4889875 | 0.3356351 | -0.5212740 | 0.5056186 | -0.7245970 | -0.5003068 | 0.5873495 | 0.4570771 | -0.9099582 | -0.2912519 | 0.3398024 | -0.7756781 |
| Firm 101 | 1.895210 | 1.9149604 | -0.1456495 | 0.0809902 | -0.1570126 | -0.1127410 | -0.0467271 | -0.1700240 | 0.0580823 | -0.0230969 | 0.0248370 | -0.0254098 | -0.0319223 | 0.0248264 |
| Firm 102 | -3.687353 | 0.9898153 | -1.2237018 | -2.5142850 | 0.5038193 | 0.6540177 | -1.3928494 | -1.7007434 | 1.2900024 | 0.0774827 | 0.0787834 | 0.7182910 | -0.5373128 | -0.0845482 |
| Firm 103 | 1.969936 | 1.5653151 | 0.7077100 | -0.0835912 | -0.2048694 | 0.0241829 | -0.0922690 | -0.0136316 | -0.0311975 | 0.0342323 | 0.0061932 | 0.0029417 | 0.0541043 | 0.0109580 |
| Firm 106 | 2.353587 | -1.6336331 | -0.2567017 | 1.1294112 | 1.5942964 | -0.6787836 | -0.5475134 | -0.4025357 | 0.3681808 | 0.4919000 | -0.0283763 | -0.4576122 | -0.3439228 | -1.1386102 |
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 2.411389 | 2.092669 | 1.035251 | 0.9175363 | 0.8528295 | 0.8201551 | 0.7019183 | 0.6781957 | 0.6407285 | 0.5959858 | 0.5921662 | 0.5266816 | 0.5214357 | 0.4804566 | 0.4783704 | 0.464964 | 0.4451537 |
| Proportion of Variance | 0.342050 | 0.257600 | 0.063040 | 0.0495200 | 0.0427800 | 0.0395700 | 0.0289800 | 0.0270600 | 0.0241500 | 0.0208900 | 0.0206300 | 0.0163200 | 0.0159900 | 0.0135800 | 0.0134600 | 0.012720 | 0.0116600 |
| Cumulative Proportion | 0.342050 | 0.599650 | 0.662690 | 0.7122200 | 0.7550000 | 0.7945700 | 0.8235500 | 0.8506100 | 0.8747500 | 0.8956500 | 0.9162800 | 0.9325900 | 0.9485900 | 0.9621700 | 0.9756300 | 0.988340 | 1.0000000 |
This provides us with 14 principal components that explain over 95% of the total variance of the dataset. This suggests that there is little explanatory power in this method however I will still investigate merits.
I am now going to plot PC1 against PC2 to see if there are obvious clusters before performing k-means clustering.
There are few discernible groups using just PCs 1 and 2 alongside ggplot2’s geom_bin2d function to highlight clusters. I will use k-means clustering, an unsupervised ML technique used to group these points together minimising within-cluster variance and it is an incredibly efficient and explainable algorithm. The latter point is important when sharing findings with non-technical colleagues.
What this shows us is the loadings of each of the PCs, we can see that PC1 is largely influenced by multiple of the metrics and PC2, the rest. As we expect, in cases where there is high pure_net_claims_ratio there will also be a high pure_gross_claims_ratio which matches our earlier charts. These two PC loadings agree with what we saw in the correlation matrix where there are examples of strong positive multicolinearity but no strong negative multicolinearity or indeed just colinearity.
K-means clustering is an iterative process that assigns each data point to a group (randomly at first) and are incrementally clustered based on similarities to other points. Firstly, K ‘centroids’ are positioned randomly and the data points are assigned to closest centroid (based on euclidean distance traditionally). The centroids are then repositioned based on the average position of it’s constituents and the process goes again. This process stops when the clusters do not change further.
This chart above shows us how the groups form with different values of K, due to the shape of the population (part due to cleaning outliers and using median values instead of investigating temporally) the groups begin difficult to disentangle when K > 5 when only looking at PCs 1 and 2.
There is no obvious optimal K value for grouping these columns. As WSS score reduces to a more acceptable level, there are so many groups that any inference is lost. In order for PCA to be successful in either inference or prediction the data will need a more effective clean to allow for the algorithms to be more successful. Due to my cleaning methods the dataset was reduced which diminishes the power of PCA. Nonetheless, we can provide this information to the supervisors to help inform their choice.
Once we can employ cloud technologies, this project will receive multiple advantages from integrating such tools. Utilising platforms such as Azure Data Factory, which is focused on hybrid data integration, will allow for reduced siloing of data, more effective data ingestion, smarter data pipelines and a cost-effective approach to daily batch processing.
By having a central environment where one can see all databases, like Azure Data Explorer, we reduce the issues of siloing. This enables parties to be more aware of all data the Bank collects, even if access isn’t immediately available due to security reasons. This will improve collaboration across directorships as teams find overlaps in certain workflows and will reduce duplication of data collection. This is important for my report as firstly, it grants a wider readership and benefits more parties. Secondly, it stops duplication from other colleagues due to its visibility and lastly can be considered a golden source as there is no secondary data source coming in and so can promote collaboration across the Bank.
Azure Data Factory will also allow stronger data quality assurance processes as data ingestion will go through this system and outliers or misreported data are more likely to be flagged meaning that my published report can be trusted.
For daily batch processing, the data is ingested and checked by Azure Data Factory. This is then loaded using custom ELT pipelines into the report hosted on a cloud platform that is scheduled to update every morning. We can then monitor this process and identify any issues that we may need to resolve. Overall, hosting the report on a cloud server reduces the requirement for manual processes from myself or other colleagues allowing us to focus on new projects and targets thus increasing productivity. As we have outsourced ingestion and server services, we can save costs and have increased confidence that the servers and databases will be consistently available without outages or issues that arise from internal key-person risks.